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Abstract 

Standard sparse pseudo-input approximations to 
the Gaussian process (GP) cannot handle com¬ 
plex functions well. Sparse spectrum alternatives 
attempt to answer this but are known to over-fit. 
We suggest the use of variational inference for 
the sparse spectrum approximation to avoid both 
issues. We model the covariance function with 
a finite Fourier series approximation and treat it 
as a random variable. The random covariance 
function has a posterior, on which a variational 
distribution is placed. The variational distribu¬ 
tion transforms the random covariance function 
to fit the data. We study the properties of our 
approximate inference, compare it to alternative 
ones, and extend it to the distributed and stochas¬ 
tic domains. Our approximation captures com¬ 
plex functions better than standard approaches 
and avoids over-fitting. 


1 Introduction 


The Gaussian process (GP, [Rasmussen & Williams! |2006| ) 
is a powerful tool for modelling distributions over non¬ 
linear functions. It offers robustness to over-fitting, a 
principled way to tune hyper-parameters, and uncertainty 
bounds over the outputs. These properties are critical for 
tasks including non-linear function regression, reinforce¬ 
ment learning, density estimation, and mor e (jBrochu et alT 


201^ Rasmussen et al.j 2003 1 Engel et al. 2005 Titsias & 


Lawrence 2010| l. Butthe advantages of the Gaussian pro 
cess come with a great computational cost. Evaluating the 
GP posterior involves a large matrix inversion - for N data 
points the model requires 0{N^) time complexity. 


Many approximations to the GP have been proposed to re¬ 


duce the model’s time complexity. Quinonero-Candela & 


Rasmussen (2005i survey approaches relying on “sparse 


pseudo-input” approximations. In these, a small number of 
points in the input space with corresponding outputs (“in¬ 
ducing inputs and outputs”) are used to define a new Gaus¬ 
sian process. The new GP is desired to be as close as pos¬ 
sible to the GP defined on the entire dataset, and the matrix 
inversion is now done with respect to the inducing points 
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alone. These approaches are suitable for locally complex 
functions. The approximate model would place most of the 
inducing points in regions where the function is complex, 
and only a small number of points would be placed in re¬ 
gions where the function is not. Highly complex functions 
cannot be modelled well with this approach. 


Lazaro-Gredilla et al. (2010|l suggested an alternative ap¬ 


proximation to the GP model. In their paper they sug¬ 
gest the decomposition of the GP’s stationary covariance 
function into its Eourier series. The infinite series is then 
approximated with a finite one. They optimise over the 
frequencies of the series to minimise some divergence 
from the full Gaussian process. This approach was named 
a “sparse spectrum” approximation. This approach is 


closely related to the one suggested by Rahimi & Recht 


(2007| in the randomised methods community (random 
projections). In Rahimi & Rechtj ( 2007| l’s approach, the 
frequencies are randomised (sampled from some distribu¬ 
tion rather than optimised) and the Eourier coefficients are 
computed analytically. Both approaches capture globally 
complex behaviour, but the direct optimisation of the dif¬ 
ferent quantities often leads to some form of over-fitting (as 
reported in (Wilson et al. 2014| l for the SSGP and shown 
below for random projections). Similar over-fitting prob¬ 
lems observed with the sparse pseudo-input approximation 
were answered with variational inference (|Titsias[|2009]l. 


We suggest the use of variational inference for the sparse 
spectrum approximation. This allows us to avoid over¬ 
fitting while efficiently capturing globally complex be¬ 
haviour. We replace the stationary covariance function with 
a finite approximation obtained from Monte Carlo inte¬ 
gration. This finite approximation is a random variable, 
and conditioned on a dataset this random variable has an 
intractable posterior. We approximate this posterior with 
variational inference, resulting in a non-stationary finite 
rank covariance function. The approximating variational 
distribution transforms the covariance function to fit the 
data well. The prior from the GP model keeps the approxi¬ 
mating distribution from over-fitting to the data. 


Like in (Lazaro-Gredilla et al. 2010| l, we can marginalise 
over the Eourier coefficients. This results in approximate 
inference with 0{NK‘^ -\- K^) time complexity with N 
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data points and K inducing frequencies (components in the 
Fourier expansion). This is the same as that of the sparse 
pseudo-input and sparse spectrum approximations. We can 
further optimise a variational distribution over the frequen¬ 
cies reducing the time complexity to 0{NK‘^). This fac¬ 
torises the lower bound and allows us to perform distributed 
inference, resulting in 0{K) time complexity given a suffi¬ 
cient number of nodes in a distributed framework. We can 
approximate the latter lower bound and use random subsets 
of the dat a (mini batches) emplo ying stochastic variational 
inference (Hoffman et al. 2013 i. This results in 0{SK‘^) 
time complexity with S « N the size of the mini-batcnM 


In the experiments section we demonstrate the properties of 
our GP approximation and compare it to alternative approx¬ 
imations. We describe qualitative properties of the approx¬ 
imation and discuss how the approximation can be used 
to learn the covariance function by fitting to the data. We 
compare the approximation to the full Gaussian process, 
sparse spectrum GP, sparse pseudo-input GP, and random 
projections. We show that alternative approximations ei¬ 
ther over-fit or under-fit even on simple datasets. We em¬ 
pirically demonstrate the advantages of the variational in¬ 
ference in avoiding over-fitting by comparing the approxi¬ 
mation to the sparse spectrum one on audio data from the 
TIMIT dataset. We compare the stochastic optimisation to 
the non-stochastic one, and compare the performance to the 
sparse pseudo-input SVI. Finally, we inspect the model’s 
time accuracy trade-off and show that it avoids over-fitting 
as the number of parameters increases. 


2 Sparse Spectrum Approximation in Gaus¬ 
sian Process Regression 


We use Bochner’s theorem ( |Bochner]|1959| l to reformulate 
the covariance function in terms of its frequencies. Since 
our covariance function iT(x, y) is stationary, it can be 
represented as iT(x — y) for all x, y G Follow¬ 

ing Bochner’s theorem, Ff(x — y) can be represented as 
the Fourier transform of some finite measure a^p{w) with 
p(w) a probability density, 

K{x — y)= ( cr^p(w)e“^^™ 

J«.Q 


/ CT^p(w) cos(27rw^(x — y))dw (1) 

J’rQ 


since the covariance function is real-valued. 


This can be approximated as a finite sum with K terms 
using Monte Carlo integration, 

(^2 ^ 

K{x -y) ~ {2 ttwI ((x - Zfc) - (y - z^))) 

fc=i 

with Wfc ~ p(w) and z^ some Q dimensional vectors for 

'Python code for all inference algorithms is available at 
http://github.com/yaringal/VSSGP 


k = The points z^ act as inducing inputs, and 

will have corresponding inducing frequencies in our ap¬ 
proximation. For the sparse spectrum GP, these points take 
value 0. These will be explained in detail in a later section. 

Using identity [T] proved in the appendix we rewrite the 
terms above for every k as 

cos (27rWfc ((x - Zfc) - (y - z^))) 

^ V2 cos (27rw^ (x — Zfc) + 6) 

• x/^cos (27rw^(y — Zk) + b)db. 



This integral can again be approximated as a finite sum us¬ 
ing Monte Carlo integration. To keep the notation simple, 
we approximate the integral with a single sampl^for every 

iT(x-y) « ^ y2cos(27rWfc (x - z^) +bk) 

k=l 

■ V2 cos(27rWfc (y - Zk) + bk) 

=; K{x-y) 

with bk ~ Unif[0, 27r], defining the approximate covariance 
function K. We refer to {'Wk)k=i as inducing frequencies 
and to as phases, and denote u) = {'ffk,bk)k=i- 

Note that this integral could be approximated with any ar¬ 
bitrary number of samples instead. 

We denote X G the inputs and Y G the out¬ 

puts of a real-valued dataset with N data points. In Gaus¬ 
sian process regression we find the probability P(Y|X) 
with the assumption that the function generating Y is 
drawn from a Gaussian process. The full GP model is de¬ 
fined as (assuming stationary covariance function K{-, •)): 
F I X~AA(0,X(X,X)) 

Y I F ~ AA(F,t- 1I) 
with some precision hyper-parameter r. 

Using K instead as the covariance function of the Gaussian 
process yields the following generative model; 

Wfc - p(w), bk ~ Unif[0,27r], 

= {'^k,bk)k=i 
2 K 

K{x,y) = ^^V2cos (27rwfc(x- Zfc) -f bk) 

k^l 

■ V2cos (27rWfc(y - Zk) + bk) 

F I X,a; ~ A/'(0,K(X,X)) 

Y I F~AA(F,t-1I). 


^The above transformation and approximate integration are 
used in the randomised methods literature (“Random projec¬ 
tions”, [Rahimi & Rechtj 2007| l. It was shown to give better ap¬ 
proximation than Monte Carlo integration of eq. ^ Intuitively it 
is equivalent to a random phase shift for each basis function. 
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3 Random Covariance Functions 


itT is a deterministic covariance function of its inputs; K is 
a random finite rank covariance function. As such, we can 
find the conditional distribution of the covariance function 
given a dataset (more precisely, the conditional distribution 
of uj). This is a powerful view of this approximation - it 
allows us to transform the covariance function to ht the data 
well, while the prior keeps it from over-htting to the data. 
We will use K as our Gaussian process covariance function 
from now on, replacing K. This results in the following 
predictive distribution: 

p(Y|X) = yp(Y|F)p(F|a;,X)p(a;)da;dF. 

We can integrate this analytically for F and obtain 

p(Y|X)= J AA(Y;0,i^(X,X) + T-il)p(u;)da; 

but this involves the inversion of Ar(X, X) + t“^I, which 
does not allow us to integrate over uj (even variationally!). 
Instead, we introduce an auxiliary random variable. 


Denoting the 1 x K row vector 
</)(x,u;) = 


2(j2 

— cos (27rw^(x - Zk) H- bk) 


n K 




and the N x K feature matrix $ = [(/)(x„, we have 

k{X,X) = We rewrite p(YIX) as 

p(Y|X) = J AA(Y; 0, + T-^l)p{uj)duj. 

Following identity ( |Bishop|[2006 page 93, equations 2.113 
— 2.115) we introduce a K x 1 auxiliary random variable 
ad ^ JV{0, Ik) to the distribution inside the integral above. 


= J A/'(yd;$ad,T ^I)A/'(ad;0,Iif)dad, 

where yd is the d’th column of the N x D matrix Y. 
Writing A = [ad]d^i, the above is equivalent tcj^ 

p(Y|X) = J p{Y\A,X,u:)p{A)p{ij)dAdu. (2) 
We refer to A € as the Fourier coefficients. 


as well. Following proposition [T] in the appendix, given 
a sum of covariance functions with L components (with 
each corresponding to <1)^ an A x A matrix) we have $ = 
an A X LK matrix. 

As an example covariance function of this form consider 
the spectral mixture (SM) covariance function ( [Lind greii 
|2012t [Wilson & Adams| |2013] l. This covariance function 
has been used in the audio processing community since the 
’70s and was recently introduced to the machine learning 
community. It generalises many known covariance func¬ 
tions, such as the periodic covariance function, the auto¬ 
matic relevance determination (ARD) squared exponential 
(SE) covariance function, products of these and weighted 
sums of these products. 

We will continue the development of our method using this 
covariance function. Note however that our method is gen¬ 
eral and can be extended for other covariance functions 
as well. The spectral mixture covariance function with L 
components is given by 



with weights af, length-scales liq and periods p^^. We 

write Pi = [p~q]^^i and L, = diag([27r(q,]^^i). This co- 
variance function reduces to a sum of squared exponential 
(SE) covariance functions for piq = oo for all i and q. 

For p(w) composed of a single SM component, we fol¬ 
low proposition]^ in the appendix and perform a change of 
variables, resulting in p(w) a standard normal distribution 
with the parameters of p(w) now expressed in $ instead. 
For p(w) composed of several components, for each com¬ 
ponent i we get is an A X A matrix with elements 

/2cr2 _ _ 

Y -^cos (27r(L, Vfe -f Pi)^(x- Zfe) -f 6fc), 

where for simplicity, we index and bk with k = 
1,..., LK as a function of i. 


Regarding as parameters and optimising these values (in¬ 
tegrating over A) results in the sparse spectrum approx¬ 
imation (Lazaro-Gredilla et al. 20101. Regarding A as 
parameters and optimising these values (leaving uj con¬ 
stant) results in a method known as “random projections” 
( Rahimi & Recht] 2007| l. Related work to random pro¬ 
jections variationally integrates over the hyper-parameters 
while leaving uj constant ( |Tan et aLl|2013| l. 

We can extend the above to sums of covariance functions 


^This is equivalent to the weighted basis function interpreta¬ 
tion of the Gaussian process (Rasmussen & Williams 2006|l. 


4 Variational Inference 

The predictive distribution for an input point x* is given by 
p(y*|x*,X,Y) = J p{y*\jc*,A,uj)p{A,uj\X,Y)dAduj, 

( 3 ) 

withy* G The distribution p(A,a;|X,Y) cannot 

be evaluated analytically. Instead we dehne an approxi¬ 
mating variational distribution q{A, uj), whose structure is 
easy to evaluate. 

We would like our approximating distribution to be as close 
as possible to the posterior distribution obtained from the 
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full GP. We thus minimise the Kullback-Leibler divergence 
KL(<j(A,a;) |p(A,u;|X,Y)), 
resulting in the approximate predictive distribution 

q{y*\x*) = J p(y*|x*, A,u;)g(A,a;)dAda;. (4) 


Minimising the Kullback-Leibler divergence is equivalent 
to maximising the log evidence lower bound 


C : = 


q{A, uj) logp(Y|A, X, u;)dAdu; 


- KL(( 7 (A, w) I |p(A)p(a;)) (5) 

with respect to the variational parameters defining q{A,uj). 


We define a factorised variational distribution q{A, tv) = 
q{A)q{u}). We define q{uj) with a; = (wfc, to be a 

joint Gaussian distribution and a uniform distribution, 

Wfc ~ A/'(Aifc, Sfe), k = l,...,LK 

bk^\Jmf{ak,/3k), k = l,...,LK 

with Sfc diagonal, Q < ak < fik < Stt, and define q{A) = 
]\d=i (with Sid e by 

a.d ^ Af{nid,Sd), d=l,...,D 

with Sd diagonal. We evaluate the log evidence 
lower bound and optimise over {^k,^k,C(k, Pk}k=i’ 
{iiid, and {cTj, L*, to maximise Eq.|^ 

4.1 Evaluating the Log Evidence Lower Bound 


Given A and a;, we evaluate the probability of the d’th 
element, yd, as 

logp(yd|A,X,a;) = 

./V T 

- y log(27rT"^) - -(yd - $ad)^(yd - $ad). 

Note that y^ is an A x 1 vector, 4) is an A x LK matrix, 
and ad is an LK x 1 vector. 

We need to evaluate the expectations of yj$ad and 
aj$^$ad (both scalar values) under q{A)q{u>): 

E'q{A)q(ui){yd = Yd-^ 13 ( 1 *)) (^’)^'g(A) (S-d)) ( 6 ) 

and 

■E 5 (A) 9 (<..)(aJ$^$ad) = tr^£;g(^)($'^4>)Ag(A)(adaJ)^. 

(7) 


= e cos + 6 „fc). 

where fXk is the mean of q{wk) and is its covariance. 
We get 



• ( COs(/tfc X„fe + bnk)) (8) 

with the integration with respect to q{bk) trivial. 


Next we evaluate ($^$), an LK x LK matrix: 


N 2^2 

— ^ ^ yy 7i/(3(wi,6i,Wj,dj) ( COs(Wj '^ni E bni') 
n—1 


for i,j < LK. 


• cos(wJx,y + 6 „j)) (9) 


For i 7 ^ j, from independence we can break the expectation 
of each term into 

( COs(wf X„i + bni) C0s(wJx„j + bnj)) 

= ( COs(wf X„, + b„i)) 

( COs(Wj X„j + bnj)^ , 

and for i = j, 

■Eq(wi.bi)(cos(wfx„i + 6 ™)^) = ^ + 

1 -T - 

(cos(2/rf X™ + 26„i)) 

following identity!^ 

In conclusion, we obtained our optimisation objective: 

^ = log(27rr-i) - ^yjyd 

d^l ^ 

+ TydEg(^^){^)ind 
- ^tr(Aq(^)(4>'^4>)(sd + mdmJ))^ 

- KL(g(A)||p(A)) - KL(g(a;)|b(u;)). (10) 


The KL divergence terms can be evaluated analytically for 
the Gaussian and uniform distributions. 


The values Aq(A)(3-d) and Aq(A)(adaJ) are evaluated as 
Eq(^A){^d) — rUd, 

£',(A)(adaJ) = Sd + mdmj. 

Next we evaluate i? 5 (a.,) ($). Remember that $ depends on 
u) and that q{uj) = q{{'Wk,bk)k=i)- We write as shorthand 
x„fc := 27rL~^(x„ - z^) and b„k = bk + 27rpf (x„ - z*.) 
with component i appropriate to k. Following identity 
proved in the appendix, we have that the expectation of a 
single element in the vector with respect to q{'Wk) is 

( COS (w^X„fc + bnk)) 


4.2 Optimal variational distribution over A 

In the above we optimise over the variational parameters 
for A, namely rud and Sd for d < D. This allows us to 
attain a reduction in time complexity compared to previ¬ 
ous approaches and use stochastic inference, as will be ex¬ 
plained below. This comes with a cost, as the dependence 
between a; and A can render the optimisation hard. 

We can find the optimal variational distribution q{A) an¬ 
alytically, which allows us to optimise u} and the hyper¬ 
parameters alone. In propositionj^in the appendix we show 
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that the optimal variational distribution is given by 

q{ad) = A/'(S£;q(^)($'^)yd, r"^S) 
with S = (i?q(a,) 

The lower bound to optimise then reduces to 

“ Y log(27rr-i) - ^yjyd + ^ log(|r-^S|) 

d^l ^ 

-KL(g(a;)||p(a;)). (11) 


5 Distributed Inference and Stochastic Infer¬ 
ence 


Evaluating C in equation 10 requires 0{NK^) time com¬ 
plexity (for fixed Q, D, diagonal s^, and covariance func¬ 
tion with one component L = 1). This stems from the term 
- 2 l K X K matrix, where each element is 
composed of a sum over N. 


Following the ideas of ( |Gal et alT |2014| l we show that the 
approximation can be implemented in a distributed frame¬ 
work. Write 
1 

End = -- log(27rr“^) - -yndUnd + TyndEq(^){4’n)va.d 


- ^tr{Eq^^){(l)l(j)n){sd + mdinj)) 

with (/)„ = (/)(x„, tu). We can break the optimisation objec¬ 
tive in equation [^into a sum over N, 

D N 

^ = EE End - KL(g(A)||p(A)) - KL{q{u:)\\p{u;)). 

d=l n—1 

( 12 ) 


These terms can be computed concurrently on different 
nodes in a distributed framework, requiring 0{K^) time 
complexity in each iteration. We can further break the 
computation of End into a sum over K as well, thus re¬ 
ducing the time complexity to 0{K) with K inducing 
points. This is in comparison to distributed inference with 
sparse pseudo-input GPs which takes 0{K^) time com¬ 
plexity with K inducing points, resulting from the covari¬ 
ance matrix inversion. This is a major advantage, as em¬ 
pirical results suggest that in many real-world applications 
the number of inducing points should scale with the data. 


We can exploit the above representation and perform 
stochastic variational inference (SVI) by approximating the 
objective with a subset of the data, resulting in noisy gradi¬ 
ents ( Hoffman et al.|[2bl3| l. Here we use as our objective 

^ ~ TY E E 

I I d=lnes 

- KL(<z(A)||p(A)) - KL(q(^)|b(a;)). (13) 

with a mini-batch S of randomly selected points. This is an 


unbiased estimator to the lower bound. The time complex¬ 
ity of each iteration is 0{SK^) with S << N the size of 
the random subset, compared to 0{SK'^ + K^) of GP SVI 
using sparse pseudo-input approximation (|Hensman et al. 

|2m3l ). 


6 Predictive Distribution 

The approximate wedictive distribution for a point x* is 
given by equation H Denoting M = [m.d]d=i, we have 

i?,(y|x*)(y*) = i?9(<.)(0*)M (14) 

following proposition in the appendix. 

The variance of the predictive distribution is given by 
Var,(y.|,.)(y*) = T-ilc + T- (15) 

with T-ij = tr{Eqt^^){(j)^(j)^) • s^) • = j], following 

proposition]^ in the appendix (1 is the indicator function). 

When the optimal variational distribution over A is used, 
we have M = Y and for all i. 


1 Properties of the Approximate Model 


We have presented a variational sparse spectrum approxi¬ 
mation to the Gaussian process (VSSGP in short). We gave 
3 approximate models with different lower bounds: an ap¬ 
proximate model with an optimal variational distribution 


over A (equation 11 referred to as VSSGP), an approx¬ 


imate model with a factorised lower bound over the data 
points (equations [T^l^ referred to as factorised VSSGP - 
fVSSGP), and an approximation to the lower bound of the 
factorised VSSGP for use in stochastic optimisation over 


subsets of the data (equation 13 referred to as stochastic 
factorised VSSGP - sfVSSGP). 


The VSSGP model generalises on some of the GP approx¬ 
imations brought in the introduction. Fixing Efc at zero in 
our approximate model (as well as ak and jdk at 0 and 27r) 
and optimising only pk results in the sparse spectrum ap¬ 
proximation. Randomising p,k, we obtain the random pro¬ 
jections approximation (Rahimi & Recht 2007| l. Indeed, 
for Sfe = 0 and fixed phases we have that Eq^^^) (<I>^<I>) = 


E. 


q{ui) 


and Egj^^) ($) = $, and equa tion[TT] 


recovers equation 8 in (|Fazaro-Gredilla et al. 2010|l. 


The points act as inducing inputs with and bk acting 
as inducing frequencies and phases at these inputs. This 
is similar to the sparse pseudo-input approximation, but 
instead of having inducing values in the output space, we 
have the inducing values in the frequency domain. These 
are necessary to the approximation. Without these points 
(or equivalently, setting these to 0), the features would 
decay quickly for data points far from the origin (the fixed 
point 0). 
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The distribution over the frequencies is optimised to fit the 
data well. The prior is used to regulate the fit and avoid 
over-fitting to the data. This approximation can be used to 
learn covariance functions by fitting them to the data. This 
is similar to the ideas brought in ( |Duvenaud et al.| |2013| ) 
where the structure of a covariance function is sought by 
looking at possible compositions of these. This can give ad¬ 
ditional insight into the data. In ( |Duvenaud et al.[ 2013| l the 
structure of the covariance composition is used to explain 
the data. In the approximation presented here the spectrum 
of the covariance function can be used to explain the data. 


It is interesting to note that although the approximated co- 
variance function K{x,y) has to be stationary (i.e. it is 
represented as iT(x,y) = iT(x — y)), the approximate 
posterior is not. This is in contrast to the SSGP that re¬ 
sults in a stationary approximation. Furthermore, unlike 
the SSGP, our approximation is not periodic. This is one of 
the theoretical limitations of the sparse spectrum approx¬ 
imation. The limitation arises from the fact that the co- 
variance is represented as a weighted sum of cosines in the 
SSGP. In the our approximation this is avoided by decay¬ 
ing the cosines to zero. This and other properties of the 
approximation are discussed further in discussion in the 
appendix. 


8 Experiments 


We next study the properties of the VSSGP and compare 
it to alternative approximations, showing its advantages. 
We compare the VSSGP to the full Gaussian process (de¬ 
noted Full GP), the sparse spectrum GP approximation (de¬ 
noted SSGP), a sparse pseudo-input GP approximation (de¬ 
noted SPGP), and the random projections approximation 
(denoted RP). We compare the VSSGP to the fVSSGP and 
sfVSSGP that offer improved time complexity. We further 
compare sfVSSGP to the existing sparse pseudo-input GP 


approach used with SVI (denoted sSPGP, Hensman et al. 


|2013| ). We inspect the model’s time accuracy trade-off and 
show that it avoids over-fitting as the number of parameters 
increases. 


8.1 VSSGP Properties 


We evaluate the predictive mean and uncertainty of the 
VSSGP on the atmospheric CO 2 concentrations dataset de¬ 
rived from in situ air samples collected at Mauna Loa Ob¬ 
servatory, Hawaii ( Keeling et al.[ 2004| l. We fit the approx¬ 
imate model using a spectral mixture covariance function 
with two components initialised with periods [5, 00 ] and 
corresponding initial length-scales [0.1,1000] (resulting in 
a sum of SE x periodic and SE covariances). We ran¬ 
domised the phases following the Monte Carlo integration 
(instead of optimising a variational distribution on thes^ 
and initialise the frequencies at random. We use 10 induc- 


"^This seems to work better in practice. 



Figure 1. Predictive mean and uncertainty on the Mauna Loa CO 2 
concentrations dataset. In red is the observed function; in blue is 
the predictive mean plus/minus two standard deviations. In this 
example the approximating distribution is used with a spectral 
mixture covariance with two components [L = 2, K = 10). 


ing inputs for each component (K = 10), set the obser¬ 
vation noise precis ion to t = 10, a nd covariance noise to 
cr^ = 1. LBEGS (Zhu et al. |1997 1 was used to optimise 
the objective given in equation 11 and was run for 500 it¬ 
erations. 


Eigure[2 shows the predictive mean with the predictive un¬ 
certainty increasing far from the data. This is a property 
shared with the SE GR The covariance hyper-parameters 
optimise to periods of [9.8, 00 ], length-scales [0.09,54], 
and covariance noise [0.0043,5.7], correspondingly. The 
frequency with the smallest standard-deviation (highest 
confidence) for the first component is 1 (corresponding to 
a period of 1 year, capturing the short term behaviour). Eor 
the second component these are 0.0053, 0.00065 (corre¬ 
sponding to periods of 185 and 1536 years capturing the 
long term behaviour). 


8.2 Comparison to Existing GP Approximations 


We compare various GP approximations on the solar irra- 
diance dataset (Lean 2004| l. We scaled the dataset dividing 
by the data standard deviation, and removed 5 segments of 
length 20. We followed the experiment set-up of the previ¬ 
ous section and used the same initial parameters for all ap¬ 
proximate models. Instead of the SM covariance function 
we use a single SE setting its length-scale I = 1, and used 
50 inducing inputs. LBEGS was used for 1000 iterations. 
The RP model was run twice with two different settings; 
once following the same set-up of the other models, opti¬ 
mising over the model hyper-parameters (RPi), and once 
keeping all hyper-parameters fixed and setting the observa¬ 
tion noise precision to t = 100 with K = 500 inducing 
input^(RP2). 


Eigure shows qualitatively the predictive mean and un¬ 
certainty of the various approaches. SSGP and RP seem 
to over-fit the function using high frequencies with high 
confidence. SPGP seems to under-fit the function, but has 
accurate predictive mean and uncertainty at points where 
many inducing inputs lie (such as the flat region). VSSGP’s 
predictive mean resembles that of the full GP, but with in- 


^This follows the usual use of the model in the randomised 
methods community. We experimented with various values of r 
and decided to use 100. 
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(c) Random Projections (RP 2 , K = 500) 



(d) Full GP 


Solar 

SPGP 

SSGP 

RPi 

RP 2 

GP 

VSSGP 

Train 

0.23 

0.15 

0.32 

0.04 

0.08 

0.13 

Test 

0.61 

0.63 

0.65 

0.76 

0.50 

0.41 


Table 1. Imputation RMSE on both train and test sets, for the re¬ 
constructed solar irradiance dataset. All tests were done with the 
SE covariance function, and all sparse approximations use 50 in¬ 
ducing inputs (apart from RP 2 that uses K = 500). 

the full GP seems to get worse results than VSSGP. This 
might be because of the (slightly) larger learnt length-scale. 

8.3 From SSGP to Variational SSGP 

We use of variational inference in the VSSGP to avoid over¬ 
fitting to the data, a behaviour that is often observed with 
SSGP. To test this we perform a direct comparison of the 
proposed approximate model to SSGP on the task of au¬ 
dio signal imputation. For this experiment we used a short 
speech signal with 1000 samples taken from the TIMIT 
dataset ( |Garofolo et al.| |1993| ). We removed 5 segments 
of length 40 from the signal, and evaluated the imputation 
error (RMSE) of the predictive mean with K = 100 in¬ 
ducing points. We used the same experiment set-up as be¬ 
fore with a sum of 2 SE covariance functions with length- 
scales I = [2,10] and observation noise precision r = 1000 
matching the signal magnitude. LBEGS was run for 1000 
iterations. The experiment was repeated 5 times and the 
results averaged. 



Audio IK 

SSGP 

VSSGP 

Train 

0.0091 ± 0.0042 

0.0062 ±0.00048 

Test 

0.088 ±0.033 

0.034 ± 0.0043 


Table 3. Imputation RMSE on both train and test sets, for a speech 
signal segment of length IK (K = 100). 


Figure 2. Predictive mean and uncertainty on the reconstructed 
solar irradiance dataset with missing segments, for the GP and 
various GP approximations. In red is the observed function and in 
green are the missing segments. In blue is the predictive mean 
plus/minus two standard deviations of the various approxima¬ 
tions. All tests were done with the SE covariance function, and all 
sparse approximations use K = 50 inducing inputs (apart from 
RP 2 with K = 500). 

creased uncertainty throughout the space. Eurther, its un¬ 
certainty on the missing segments is smaller than that of the 
full GP (some frequencies have low uncertainty, thus used 
near the data). The full GP learnt length-scale is 4. VSSGP 
learnt a length-scale of 3, and SPGP learnt a length-scale 
of 5. SSGP and RPi learnt length-scales of 0.97,1.66, i.e. 
the hyper-parameter optimisation found a local minimum. 


Table shows the RMSE of the training set and test set 
for the audio data. SSGP seems to achieve a small training 
error but cannot generalise well to unseen audio segments. 
VSSGP attains a slightly lower training error, and is able to 
impute unseen audio segments with better accuracy. 

It is interesting to note that using the RMSE of the short- 
time Eourier transform of the original signal and the pre¬ 
dicted mean (STET, the common metric for audio imputa¬ 
tion, with 25ms frame size and a hop size of 12ms), the 
SSGP model attains a training error of 0.094 ± 0.05 and a 
test error of 0.55 ± 0.41. The VSSGP attains a training er¬ 
ror of 0.067 ± 0.0067 with a test error of 0.17 ± 0.022. Eor 
comparison, baseline performance of predicting 0 attains 
an error of 0.44 on the training set and an error of 0.38 on 
the test set. 


Table[T]gives a quantitative comparison of the different ap¬ 
proximations for the task of imputation. RMSE (root mean 
square error) of the approximate predictive mean on the 
missing segments was computed (test error), as well as the 
RMSE on the observed function (training error). Note that 


8.4 VSSGP, factorised VSSGP, and stochastic fac¬ 
torised VSSGP 

VSSGP, fVSSGP, and sfVSSGP all rely on different lower 
bounds to the same approximate model. Whereas VSSGP 
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Audio IK 

VSSGP 

fVSSGP 

sfVSSGP 

Train 

0.0062 ± 0.00048 (o.o 63 ±o.oo 68 ) 

0.0054 ± 0.00083 (o.o 55 ±o.oo 88 ) 

0.005 ± 0.003 (o.o 52 ±o. 03 i) 

Test 

0.034 ± 0.0043 (o.i 7 ±o. 022 ) 

0.038 ± 0.0049 (o. 22 ±o. 028 ) 

0.04 ± 0.0066 (o. 24 ±o.oo 89 ) 


Table 2. Imputation RMSE (and in smaller font STFT RMSE) on train and test sets, for a speech signal segment of length IK (K = 100). 


solves for the variational distribution over the Fourier coef¬ 
ficients analytically, fVSSGP optimises over these quanti¬ 
ties. This reduces the time complexity, but with the price of 
potentially worsened performance. sfVSSGP further em¬ 
ploys an approximation to the lower bound using random 
subsets of the data - following the idea that not all data 
points have to be observed for a good fit to be found. This 
assumption has the potential to hinder performance even 
further. We next assess these trade-offs. 

We repeated the experimental set-up of the previous sec¬ 
tion (and use the same RMSE for VSSGP). We optimise 
both fVSSGP and sfVSSGP for 5000 iterations instead of 
the 1000 of VSSGP. This is because the improved time 
complexity allows us to perform more function evaluations 
within the same time-frame. We optimise the fVSSGP 
lower bound with LBFGS, and the sfVSSGP lower bound 
with RMSPROP dTieleman & Hinton||20T2] l. RMSPROP 
performs stochastic optimisation with no need for learning- 
rate tuning - the learning rate changes adaptively based on 
the directions of the last two gradients. 

Table |2] shows the RMSE for the train and test sets. Both 
fVSSGP and sfVSSGP effectively achieve the same test set 
accuracy (taking the standard deviation into account). We 
also see a slight decrease in train set RMSE. 

8.5 Stochastic Variational Inference 


We compared sfVSSGP to the SPGP approximation with 
stochastic variational inference (sSPGP, ||Hensman et al.| 


20131. We used the same audio experiment as above, but 
with a signal of length 16000. 25 random segments of 
length 80 were removed from the signal. sSPGP’s time 
complexity (0{SK^ + K^) with mini-batch of size S and 
K inducing points) prohibits it from being used with a large 
number of inducing points. We therefore used 800 inducing 
points for sSPGP and 400 inducing inputs for each compo¬ 
nent in the covariance function of sfVSSGP {K = 400). 

The RMSE of sSPGP for the training set is 0.043 and for 
the test set is 0.034 (with a training time of 133 minutes 


0.40 
0.35 
£ 0.30 
c 0.25 
^ 0.20 
|o.. 
0.10 
0.05 



(a) Train error 


(b) Test error 


(c) Running time 


Figure 3. Mean and standard deviation for train error, test error, 
and running time, all as functions of the number of inducing 
points (K) for a speech signal segment of length 4K. 


using GPy ( |authors||20r2-2014| )). The RMSE of sfVSSGP 
for the training set is 0.016 and for the test set is 0.034 (with 
a training time of 48 minutes). Using the same audio im¬ 
putation metric as in the previous section, we get that the 
STET RMSE for the sSPGP on the training set is 0.54 and 
on the test set is 0.43. The STFT RMSE for the VSSGP on 
the training set is 0.18 with a test error of 0.3. Eor compar¬ 
ison again, baseline performance of predicting 0 attains an 
error of 0.52 on the training set and an error of 0.62 on the 
test set. 


8.6 Speed-Accuracy Trade-off 

We inspect the speed-accuracy trade-off of the approxima¬ 
tion (RMSE as a function of the number of inducing points) 
for the sfVSSGP approximation. We repeat the same audio 
experiment set-up with a speech signal with 4000 samples 
and evaluate the imputation error (RMSE) of the predictive 
mean with various numbers of inducing point. RMSPROP 
was run for 500 iterations. The experiment was repeated 5 
times and the results averaged. 

Eigurej^ shows that the approximation offers increased ac¬ 
curacy with an increasing number of inducing points. No 
further improvement is achieved with more than 400 induc¬ 
ing points. The time scales quadratically with the number 
of inducing points. Note that the approximation does not 
over-fit to the data as the number of parameters increases. 


9 Discussion 


Our approximate inference relates to the Bayesian neural 
network (Bayesian AW, Mackay[ 1992[ MacKay 1992| l. 
In the Bayesian NN a prior distribution is placed over 
the weights of an NN, and a posterior distribution (over 
the weights and outputs) is sought. The model offers a 
Bayesian interpretation to the classic NN, with the desired 
property of uncertainty estimates on the outputs. Inference 
in Bayesian NNs is generally hard, and approximations to 
the model are often used (Bishop 2006| pp 277-290). Our 
GP approximate inference relates Bayesian NNs and GPs, 
and can be seen as a method for tractable variational infer¬ 
ence in Bayesian NNs with a single hidden layer. 

Euture research includes the extension of our approxima¬ 
tion to deep GPs ( Damianou & Lawrence] 2012| l. We also 
aim to use the approximate model as a method for adding 
and removing units in an NN in a principled way. Lastly, 
we aim to replace the cosines in the Eourier expansion with 
alternative basis functions and study the resulting approxi¬ 
mate model. 











































Improving the Gaussian Process Sparse Spectrum Approximation by Representing Uncertainty in Frequency Inputs 


References 

authors. The GPy. GPy; A Gaussian process framework 
in python, http : / /github . com/Shef f ieldML/ 
GPy,2012-2014. 

Bishop, Christopher M. Pattern Recognition and Machine 
Learning (Information Science and Statistics). Springer- 
Verlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN 
0387310738. 

Bochner, Salomon. Lectures on Fourier integrals. Num¬ 
ber 42. Princeton University Press, 1959. 

Brochu, Eric, Cora, Vlad M, and de Freitas, Nando. 
A tutorial on Bayesian optimization of expensive cost 
functions, with application to active user modeling and 
hierarchical reinforcement learning. arXiv preprint 
arXiv:1012.2599, 2010. 

Damianou, Andreas C and Lawrence, Neil D. Deep Gaus¬ 
sian processes. arXiv preprint arXiv:1211.0358, 2012. 

Duvenaud, David, Lloyd, James Robert, Grosse, Roger, 
Tenenbaum, Joshua B., and Ghahramani, Zoubin. Struc¬ 
ture discovery in nonparametric regression through com¬ 
positional kernel search. In Proceedings of the 30th 
International Conference on Machine Learning, June 
2013. 

Engel, Yaakov, Mannor, Shie, and Meir, Ron. Reinforce¬ 
ment learning with Gaussian processes. In Proceedings 
of the 22nd international conference on Machine learn¬ 
ing, pp. 201-208. ACM, 2005. 

Gal, Yarin, van der Wilk, Mark, and Rasmussen, Carl. Dis¬ 
tributed variational inference in sparse Gaussian process 
regression and latent variable models. In Ghahramani, 
Z., Welling, M., Cortes, C., Lawrence, N.D., and Wein¬ 
berger, K.Q. (eds.). Advances in Neural Information Pro¬ 
cessing Systems 27, pp. 3257-3265. Curran Associates, 
Inc., 2014. 

Garofolo, John S, Consortium, Linguistic Data, et al. 
TIMIT: acoustic-phonetic continuous speech corpus. 
Linguistic Data Consortium, 1993. 

Hensman, James, Fusi, Nicolo, and Lawrence, Neil D. 
Gaussian processes for big data. In Nicholson, Ann and 
Smyth, Padhraic (eds.), UAL AUAI Press, 2013. 

Hoffman, Matthew D., Blei, David M., Wang, Chong, and 
Paisley, John. Stochastic Variational Inference. Journal 
Of Machine Learning Research, 14:1303-1347, MAY 
2013. ISSN 1532-4435. 

Keeling, C.D., Whorf, T.P., and the Carbon Dioxide Re¬ 
search Group. Atmospheric C02 concentrations (ppmv) 


derived from in situ air samples collected at Mauna Loa 
Observatory, Hawaii. Scripps Institution of Oceanogra¬ 
phy (SIO), University of California, La Jolla, California 
USA 92093-0444, June 2004. 

Lazaro-Gredilla, Miguel, Quinonero-Candela, Joaquin, 
Rasmussen, Carl Edward, and Figueiras-Vidal, 
Anfbal R. Sparse spectrum Gaussian process re¬ 
gression. The Journal of Machine Learning Research, 
11:1865-1881,2010. 

Lean, J. Solar irradiance reconstruction. NOAA/NGDC 
Paleoclimatology Program, Boulder CO, USA, 2004. 
IGBP PAGES/World Data Center for Paleoclimatology. 
Data Contribution Series 2004-035. 

Lindgren, Georg. Stationary Stochastic Processes: Theory 
and Applications. CRC Press, 2012. 

Mackay, David. The evidence framework applied to clas¬ 
sification networks. Neural computation, 4(5):720-736, 
1992. 

MacKay, David JC. A practical Bayesian framework for 
backpropagation networks. Neural computation, 4(3): 
448-472, 1992. 

Quinonero-Candela, Joaquin and Rasmussen, Carl Ed¬ 
ward. A unifying view of sparse approximate Gaussian 
process regression. Journal of Machine Learning Re¬ 
search, 6:2005, 2005. 

Rahimi, Ali and Recht, Benjamin. Random features for 
large-scale kernel machines. In Advances in neural in¬ 
formation processing systems, pp. 1177-1184, 2007. 

Rasmussen, Carl Edward and Williams, Christopher K. I. 
Gaussian Processes for Machine Learning (Adaptive 
Computation and Machine Learning). The MIT Press, 
2006. ISBN 026218253X. 

Rasmussen, Carl Edward, Kuss, Make, et al. Gaussian pro¬ 
cesses in reinforcement learning. In NIPS, volume 4, pp. 
1, 2003. 

Tan, Linda S. L., Ong, Victor M. H., Nott, David J., and 
Jasra, Ajay. Variational inference for sparse spectrum 
Gaussian process regression. arXiv:1306.1999, 2013. 

Tieleman, T. and Hinton, G. Lecture 6.5 - rmsprop, 
COURSERA: Neural networks for machine learning, 
2012 . 

Titsias, Michalis and Lawrence, Neil. Bayesian Gaussian 
process latent variable model. Thirteenth International 
Conference on Artificial Intelligence and Statistics (AIS- 
TAT^), 6:844-851, 2010. 



Improving the Gaussian Process Sparse Spectrum Approximation by Representing Uncertainty in Frequency Inputs 


Titsias, Michalis K. Variational learning of inducing vari¬ 
ables in sparse Gaussian processes. In International 
Conference on Artificial Intelligence and Statistics, pp. 
567-574, 2009. 

Wilson, Andrew and Adams, Ryan. Gaussian process ker¬ 
nels for pattern discovery and extrapolation. In Proceed¬ 
ings of The 30th International Conference on Machine 
Learning, pp. 1067-1075, 2013. 

Wilson, Andrew, Gilboa, Elad, Cunningham, John P, and 
Nehorai, Arye. Fast kernel learning for multidimen¬ 
sional pattern extrapolation. In Advances in Neural In¬ 
formation Processing Systems, pp. 3626-3634, 2014. 

Zhu, Ciyou, Byrd, Richard H, Lu, Peihuang, and Nocedal, 
Jorge. Algorithm 778: L-BFGS-B: Fortran subroutines 
for large-scale bound-constrained optimization. ACM 
Transactions on Mathematical Software (TOMS), 23(4): 
550-560, 1997. 




Improving the Gaussian Process Sparse Spectrum Approximation by Representing Uncertainty in Frequency Inputs 


A Appendix 

Identity 1. 

/.27r 2 

cos{x — y) = / —V 2 cos{x + b)V2cos{y + b)db 

Jo 27r 


= cos(2/i^x + 26) + i 


Proof. We first evaluate the term inside the integral. We 
have 

cos(x + 6 ) cos{y + 6 ) 

= (cos(a::) cos( 6 ) — sin(a;) sin( 6 )) 

• (cos)?/) cos( 6 ) — sin(?/) sin( 6 )) 

= (cos(a:) cos(?/)) cos^( 6 ) + (sin(a:) sin(?/)) sin^( 6 ) 

— (sin(a;) cos{y) + cos{x) sin{y)) sin( 6 ) cos( 6 ). 
Now, since / cos^( 6 )d 6 = 5 + | sin(26), as well as 
/ sin^( 6 )d 6 — ^ — |sin(26), and J sin( 6 ) cos( 6 )d 6 = 
— I cos(26), we have 

/.27I- j 

/ —v/^cos)! + 6 )-\/ 2 cos(?/+ 6 )d 6 

Jo 27r 

= — (cos(a:) cos(?/)(7r — 0) 

TT 

+ sin(a;) sin(?/)(7r — 0) 

— (sin(a:) cos{y) + cos{x) sm{y)) ■ 0 
= cos(a: — y) 

□ 

Identity 2. 

^^Ar(w;M.s) (cos(w^x + 6 )) = cos(/?^x + 6 ) 

Proof. We rely on the characteristic function of the Gaus¬ 
sian distribution to prove this identity. 

-E^Ar(w;M.s) (cos(w^x - 1 - 6 )) 

^g-ix^EXcos(^Tx^_/,) 

where 3fi( ) is the real part function, and the transition from 
the second to the third lines uses the characteristic function 
of a multivariate Gaussian distribution. □ 

Identity 3. 

(cos(w'^x -f bf) 

= cos(2/r^x -I- 26) -I- ^ 

Proof. Following the identity cos( 6 *)^ = 

(cos(w^x -h bf) 

= (cos( 2 w^x + 26)) -h ^ 


□ 


Proposition I. Given a sum of covariance functions with 
L components (with each corresponding to an N x K 
matrix) we have $ = an N x LK matrix. 

Proof. We extend the derivation of equation to sums of 
covariance functions. Given a sum of covariance functions 
with L components 

L 

K{,x,y) = '^afKi{x,y), 

following equation we have 

K{x,y) = Y' / (Tfp^{w) cos(27rw^(x - y))dw, 

where we write af instead of aaf for brevity (with af not 
having to sum to one). 

Following the derivations of equation for each compo¬ 
nent i in the sum we get an N x K matrix. Writing 
‘J’ = \^iVi=i an TV X LK matrix, we have that the sum 
of covariance matrices can be expressed with a single term 
after marginalizing F out, 

L 

^ -h T-^I = -h T-^1, 


i=l 


thus identity Instill holds. 


□ 




Proposition 2. Performing a change of variables to the 
SM covariance function with a single component, results in 
p(w) a standard normal distribution with covariance func¬ 
tion hyper-parameters expressed in $. 

Proof. The SM covariance function’s corresponding prob¬ 
ability measure p(w) is expressed as a mixture of Gaus- 
sians, 

L Q 

Pi^) = XI n 

i—1 q—1 

i=l 

with af summing to one. 

Following equation with the above p(w) we perform a 
change of variables to get, 

K{x,y) 

= X / cr»^-V(w';p,,L)"2)cos(27rw'^(x-y))dw' 

= X / cr?-V(w;0,I)cos(27r(L)-V-hp,)'^(x-y)) 
“T JkQ 
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• dw 


for w' = L ■ w + Pj. 

For each component i we get an N x K matrix with 
elements 

V it (27’'(L”^Wfe + Pi)^(x - Zfe) + bk ), 

where for simplicity, we index and bk with k = 
1 ,LiF as a function of i. □ 

Proposition 3. Let p{a) = A/^(0,1). The optimal distribu¬ 
tion q{a) solving 

J q{^)^ogp{y\a,X,u:)du:da 

- KL{q{a)\\p{a)) - KL{q{Lj)\\p{u})) 

is given by 

q{ad) = A/'(S£'q(^)($^)yd, r"^S) 

with S = 

The lower bound to optimise then reduces to 

D 




£ = ^ ( - y log( 27 rT 1 ) - 


,yd Vd 


+ -log(|r-iS|) 


- KL{q{u})\\p{uj)). 


Proof. Let 


We want to solve 

d(^ +A/(/g(a)da- 1 )) 


dg(a) 


= 0 


for some A. I.e. 


[ q{^) logp(y|a, X, u;)du; - log - 1 + A = 0. 

J pD 


This means that 


9(a) = 


„A-1 / q{ui) logp(y|a,X,ij)dt, 


’p(a) 


= exp ^ — -a^r(i?($^$) + r ^I)a 


(ry'^L;(T>)) 


a + ... 


and since q{a) is Gaussian, it must be equal to 

9 (a)=AA(Si?,(^)($r)y, r-^S) 

with S = (£q(a;) 7)“^ 

Writing p(a) and q(a) explicitly and simplifying results in 
the required lower bound. 


□ 

Proposition 4. Denoting M = we have 

-E'q(y*|x*) (y )= £ 9 ( 0 ^) (<(’*) M. 

Proof. The d’th output of the mean of the distribution is 
given by (writing </>* = ^(x*,a;)) 

Eq{vi\^-){yl) = j A,a;)9(A,a;)dAda;dy; 

= y ( 0 *arf) 9 (A,a;)dAda; 

= J (j).,q{uj)duj J aaq{A)dA 

~ -£g(aj) 

which can be evaluated analytically following equation 

□ 

Proposition 5. The variance of the predictive distribution 
is given by 

Larq(y.|x.)(y*) = t~1d + 

+ X) ^£ 9 ( 1 ,.) (</>*) )M 

with T'jj = tr{Eq(^){(lyf ■ Si) ■ l[i = j]. 

Proof. The raw second moment of the distribution is given 
by (remember that y* is a 1 x 73 row vector) 

i?,(y.|x.)((y*)^(y*)) 

= y yy*)^(y*) 7 '(y*|x*, A,u;)dyy 9 (A,a;)dAda; 

— J (CoVp(y»|x*,A,uj)(y ) 

+ ^^p(y*|x*,A,i,.)(y*)^-E'p(y.|x*,A,u.)(y*))9(A,a;)dAda; 

= T~1d +L^ 9 (A)g(a;)(A^(;if())*A). 

Now, for i j between 1 and 73, 

( Eq{A)q{ljj) (a 4>*A^ j = £’ij(A)(j(uj) (a, fa.aj'j 
V y i.i 




and for i = j between 1 and D, 
f ^q{A)q{uj) (-^ 4^^ j ^q{A)q{uj) 

+ tr^7;9(<^)((/)^(/),) • 

following equation]^ 

Taking the difference between the raw second moment and 
the outer product of the mean we get that the variance of 
the predictive distribution is given by 
Var 9 (y.|x.)(y*) = r”il£, + 4' 

+ {Eg(^^){(ll(j)^) - £’9(0,) (<(>*)"^£9(1,.) (<(>*))M 
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with (j)^) ■ s^) • t[i = j], □ 

Discussion 1. We discuss some of the key properties of the 
VSSGP, fVSSGP, and sfVSSGP. Due to space constraints, 
this discussion was moved to the appendix. 

Unlike the sparse pseudo-input approximation, where the 
variational uncertainty is over the locations of a sparse 
set of inducing points in the output space, the uncer¬ 
tainty in our approximation is over a sparse set of func¬ 
tion frequencies. As the uncertainty over a frequency (E^) 
grows, the exponential decay term in the expectation of 
<1) decreases, and the expected magnitude of the feature 
([(^'g(a;)(‘&))n,fc]^=i) tends to zero for points x„ far from 
Zfe. Conversely, as the uncertainty over a frequency de¬ 
creases, the exponential decay term increases towards one, 
and the expected magnitude of the feature does not dimin¬ 
ish for points x„ far from z^. 

With the predictive uncertainty in equation[T^we preserve 
many of the GP characteristics. As an example, consider 
the SE covariance functior0 In full GPs the variance in¬ 
creases towards cr^ + far away from the data. This 
property is key to Bayesian optimisation for example where 
this uncertainty is used to decide what action to take given 
a GP posterior. 

With the SE covariance function, our expression for 
(j)if contains an exponential decay term exp(—i(x„ — 
Zfe)^Efc(x„ — Zfc)). This term tends to zero as x„ diverges 
from Zfc. Eor x„ far away from z^ for all k we get that the 
entire matrix $ tends to zero, and that tends 

to ^Ife. 

Eor fVSSGP, equation [^then collapses to 

Varg(y.|x*)(y*) = t-^Id -f T'' 
with = j]). 

This term leads to identical predictive variance to that of the 
full GP when A is hxed and follows the prior. It is larger 
than the predictive variance of a full GP when > 1 — 
on average, and smaller otherwise. 

Unlike the SE GP, the predictive mean in the VSSGP with 
a SE covariance function does not tend to zero quickly far 
from the data. This is because the model can have high 
confidence in some frequencies, driving the inducing fre¬ 
quency variances (E^) to zero. This in turn requires x„—z^ 
to be much larger for the exponential decay term to tend to 
zero. The frequencies the model is conhdent about will be 
used far from the data as well. 

Unlike the SSGP, the approximation presented here is not 
periodic. This is one of the theoretical limitations of the 
sparse spectrum approximation (although in practice the 

OGiven by exp ( - i ) 


period was observed to often be larger than the range of the 
data). The limitation arises from the fact that the covariance 
is represented as a weighted sum of cosines in SSGP. In the 
approximation we present here this is avoided by decaying 
the cosines to zero. 


It is interesting to note that although our approximated co- 
variance function K{x,y) has to be stationary (i.e. it can 
be represented as A(x, y) = K{x — y)), the approximate 
posterior is not. This is because stationarity entails that for 
all X it must hold that K(x,x) = K{x — x) = K{0). 
But for X)) = we have that the 

diagonal terms depend on x; 


K 






fc=l 


Eq[bk) ( ^nfc)) ■ 

This is in comparison to the SSGP approximation, where 
the approximate model is stationary. 


It is also interesting to note that the lower bound in equation 
[T 0 |is equivalent to that of equation[^for non-diagonal. 
Eor Sd diagonal the lower bound is looser, but offers im¬ 
proved time complexity. 


The use of the factorised lower bound allows us to save on 
the expensive computation of A for small updates of u). 
Intuitively, this is because small updates in u} would result 
in small updates to A. Thus solving for A analytically at 
every time point without re-using previous computations 
is very wasteful. Optimising over A to solve the linear 
system of equations (given uj) allows us to use optimal A 
from previous steps, adapting it accordingly. 

Also, even though it is possible to analytically integrate 
over A, we can’t analytically integrate u). This is because 
UJ appears inside a cosine inside an exponent in equation 
We can’t solve for uj analytically either in the equation 
preceding equation]^ This is again because uj appears in¬ 
side a cosine (unlike A which appears in a quadratic form 
in that equation). 


Einally, we can approximate our approach to achieve a 
much more scalable implementation by only using the K' 
nearest inducing inputs for each data point. This is follow¬ 
ing the observation that for short length-scales and large E, 
the features will decay to zero exponentially fast with the 
distance of the data points from the inducing inputs. 





